Statement of Contribution

Nazli Bilgic (nazbi056) was responsible for coding and writing analysis for assignment one.

Siddhesh Sreedor (sidsr770) was responsible for coding and writing analysis for assignment two.

We split the work and after completion we collaborated together to do and understand the other persons work.

Assignment 1- Visualization of mosquitos populations

  1. Use MapBox interface in Plotly to create two dot maps (for years 2004 and 2013) that show the distribution of the two types of mosquitos in the world (use color to distinguish between mosquitos). Analyze which countries and which regions in these countries had high density of each mosquito type and how the situation changed between these time points. What perception problems can be found in these plots?

For year 2004 blue dots(Aedes albopictus); are more concentrated in United states. Memphis and Jackson regions are the places where the points are more.

For year 2013 for the same regions we dont see any blue dots on these regions. There are Aedes albopictus mostly seen in Italy and Taiwan for this year.

For year 2004 red dots: in 2004 we can see more red dots gathered in Brazil and Venezuela.

For year 2013 blue dots: we see increase in the number of red dots in Brazil.

There are numerous points which are very close to each other and its causing overplotting problem. In areas like this it is hard to tell if there are many points or few points. It can be misleading for people who are analyzing the point distribution on the map. It can cause wrong estimations.

  1. Compute Z as the numbers of mosquitos per country detected during all study period. Use plot_geo() function to create a choropleth map that shows Z values. This map should have an Equirectangular projection. Why do you think there is so little information in the map?

The color scale is designed to cover the whole range equally. Color scale spans wide range of values. For ‘CAN’ mosquito count is 1 and for ‘BRA’ count is 8501. Because the numebrs are apart from each other we can notice the mosquito amount difference for these countries from the colors on the map.

For the smaller count differences (like 1 to 10). Even tough there are differences between the values numerically. It is difficult to actually see the difference between countries with low count values(colors are very similar) by looking to the map.

  1. Create the same kind of maps as in step 2 but use
  1. Equirectangular projection with choropleth color log(Z)

High number of mosquito countries are colored darker. Brazil has the most mosquitos and then from the map we can see that USA also has large number of mosquitos.

Regions close to poles are noticeably enlarged compared to their true size.

  1. Conic equal area projection with choropleth color log(Z). Analyze the map from step 3a and make conclusions. Compare maps from 3a and 3b and comment which advantages and disadvantages you may see with both types of maps.

Areas near the edges don’t show clearly(distorted) we can see clearly when we zoom in. We can clearly comment/understand which countries has the most mosquito counts by looking to the color scale. (USA and Brazil has darker color)

  1. In order to resolve problems detected in step 1, use data from 2013 only for Brazil and
  1. Create variable X1 by cutting X into 100 piecies (use cut_interval() )
  2. Create variable Y1 by cutting Y into 100 piecies (use cut_interval() )
  3. Compute mean values of X and Y per group (X1,Y1) and the amount of observations N per group (X1,Y1)
  4. Visualize mean X,Y and N by using MapBox

Identify regions in Brazil that are most infected by mosquitoes. Did such discretization help in analyzing the distribution of mosquitoes?

## # A tibble: 1,955 × 5
## # Groups:   X1 [90]
##    X1            Y1            mean_X mean_Y     N
##    <fct>         <fct>          <dbl>  <dbl> <int>
##  1 [-72.8,-72.4] (-8.21,-7.84]  -72.8  -7.96     1
##  2 (-71.2,-70.8] (-8.94,-8.57]  -70.8  -8.9      1
##  3 (-70.4,-70]   (-10.8,-10.4]  -70.0 -10.7      1
##  4 (-70,-69.6]   (-4.14,-3.77]  -69.7  -4.03     1
##  5 (-69.6,-69.2] (-10.8,-10.4]  -69.2 -10.7      1
##  6 (-69.6,-69.2] (-10.1,-9.68]  -69.4  -9.77     1
##  7 (-69.2,-68.8] (-1.56,-1.19]  -68.8  -1.39     1
##  8 (-68.8,-68.4] (-11.2,-10.8]  -68.6 -10.9      1
##  9 (-68.8,-68.4] (-10.8,-10.4]  -68.6 -10.6      1
## 10 (-68.8,-68.4] (-10.1,-9.68]  -68.4 -10        1
## # ℹ 1,945 more rows

We are plotting fewer points now and we have less overplotting. Which helps to comment easier and be more accurate about the mosquito distribution of regions.

Around ‘Recife’ and ‘Maceio’ regions has high concentration of dots. Especially blue and purple dots are concantrated here. Which shows high number of mosquitos. Also, ‘Sao Paulo’, ‘Ribeirao Preto’ and ‘Londrina’ regions show high number of mosquitos.

Assignment 2 Visualization of income in Swedish households

  1. Download a relevant map of Swedish counties from http://gadm.org/country and load it into R. Read your data into R and process it in such a way that different age groups are shown in different columns. Let’s call these groups Young, Adult and Senior.

  2. Create a plot in Plotly containing three violin plots showing mean income distributions per age group. Analyze this plot and interpret your analysis in terms of income.

From the box plots, we can see that the income of young is way lower than adult and senior while the income of senior can be seen to be little bit higher than adult. This can be because in the young category they dont have much work experience so they start with a lower income and as the gain more experience, they grow and go to the adult category and therefore their income increases and then they slowly reach an saturation point as they go to the senior category where their income doesnt increase as drastically as the change from young to the adult category.

  1. Create a surface plot in Plotly showing dependence of Senior incomes on Adult and Young incomes in various counties. What kind of trend can you see and how can this be interpreted? Do you think that linear regression would be suitable to model this dependence?

We see a positive correlation, it indicates that senior incomes tend to increase proportionally with adult and young incomes across the counties.

We can see a linear trend to the model so a linear regression would be suitable to model this dependence as it assumes a linear relationship between the independent variables (adult and young) and the dependent variable (senior).

  1. Use plot_geo function with trace “choropleth” to visualize incomes of Young and Adults in two choropleth maps. Analyze these maps and make conclusions. Is there any new information that you could not discover in previous statistical plots?

An interesting observation that we can see from this plot for adults is that as we move from the north to the south, the income increases.

An interesting observation that we can see from this plot for adults is that as we move from the north to the south, the income increases. But we dont notice such a strong pattern like that for the young category which can be because of they have less work experience to dictate their income.

  1. Use GPVisualizer http://www.gpsvisualizer.com/geocoder/ and extract the coordinates of Linköping. Add a red dot to the choropleth map for Young from step 4 in order to show where we are located :)

Appendix

Chunk Label: setup

knitr::opts_chunk$set(echo = TRUE)

Chunk Label: q1

library(plotly)
library(dplyr)
library(readr)

mosquito_data<-read.csv("aegypti_albopictus.csv")

data_2004 <- mosquito_data %>% filter(YEAR == 2004)
data_2013 <- mosquito_data %>% filter(YEAR == 2013)


Sys.setenv('MAPBOX_TOKEN' = 'pk.eyJ1IjoibmF6YmkwNTYiLCJhIjoiY20xNm1wdnl2MGgwNTJscXhuZzNzZmh2dSJ9.juf_tlUFGCUHvY9MCSX5lw')

p_2004<-plot_mapbox(data_2004)%>%add_trace(type="scattermapbox",lat=~Y, lon=~X,
                               color=~VECTOR,colors = c('red', 'blue'))%>%
  layout(
    title = "mosquito species distribution-2004"
  )

p_2013<-plot_mapbox(data_2013)%>%add_trace(type="scattermapbox",lat=~Y, lon=~X,
                                         color=~VECTOR,colors = c('red', 'blue'))%>%
  layout(
    title = "mosquito species distribution-2013"
  )

p_2004
p_2013

Chunk Label: q2

mosquito_count<- table(mosquito_data$COUNTRY_ID)

mosquito_count_country <- as.data.frame(mosquito_count)

colnames(mosquito_count_country) <- c("COUNTRY_ID", "Z")

g<-list(fitbounds="locations", visible=FALSE,projection = list(type = "equirectangular"))

p_geo<-plot_geo(mosquito_count_country)%>%add_trace(type="choropleth",
                                                    z = ~Z,locations = ~COUNTRY_ID,
                                                    colors = "Blues")%>% colorbar(title='Number of <br>Occurences') %>% layout(geo=g)

p_geo

Chunk Label: q3.a

g_equirectengular<-list(projection = list(type = "equirectangular"))

p_geo_log<-plot_geo(mosquito_count_country)%>%add_trace(type="choropleth",
                                                    z = ~log(Z),locations = ~COUNTRY_ID,
                                                    colors = "Blues") %>%
  layout(geo=g_equirectengular)

p_geo_log

Chunk Label: q3.b

g_conic<-list(projection = list(type = "conic equal area"))

p_geo_conic<-plot_geo(mosquito_count_country)%>%add_trace(type="choropleth",
                                                    z = ~log(Z),locations = ~COUNTRY_ID,
                                                    colors = "Blues") %>% layout(geo=g_conic)

p_geo_conic

Chunk Label: q4.a

data_2013_brazil <- data_2013 %>% filter(COUNTRY_ID == "BRA")

X<-c(data_2013_brazil$X)

X1<-cut_interval(X,n=100)

Chunk Label: q4.b

Y<-c(data_2013_brazil$Y)

Y1<-cut_interval(Y,n=100)

Chunk Label: q4.c

new_df_x1_y1<-data.frame(X=X,Y=Y,X1=X1,Y1=Y1)

result <- new_df_x1_y1 %>%
  group_by(X1, Y1) %>%
  summarize(
    mean_X = mean(X),   
    mean_Y = mean(Y), 
    N = n()             
  )

result

Chunk Label: q4.d

p_mean<-plot_mapbox(result)%>%add_trace(type="scattermapbox", mode = 'markers',lat=~mean_Y, 
                                            lon=~mean_X,size=~N,
                                           color=~N,colors = c('red', 'blue'),
                                           text= ~paste("meanx:", mean_X, "<br>meanY:",mean_Y, "<br>N:", N),
                                           hoverinfo ='text') 

p_mean

Chunk Label: q0

data = read.csv("data.csv",header = TRUE)

Chunk Label: q2.11


library(stringr)
library(tidyr)

#setwd("~/Downloads/M A S T E R S /Sem-3/Part-1/Visualization/labs/lab-3")

county_map<-jsonlite::read_json("county.json")


data$region<-str_sub(data$region,4,-1)
data$region<-str_sub(data$region,1,-8)


data <- spread(data,"age","X2016")

colnames(data) <- c("region","young","adult","senior")


data$region[4]<- "Gävleborg"
data$region[6]<- "Jämtland"
data$region[7]<- "Jönköping"
data$region[11]<- "Skåne"
data$region[13]<- "Södermanland"
data$region[15]<- "Värmland"
data$region[16]<- "Västerbotten"
data$region[17]<- "Västernorrland"
data$region[18]<- "Västmanland"
data$region[19]<- "VästraGötaland"
data$region[20]<- "Örebro"
data$region[21]<- "Östergötland"

Chunk Label: q2.2

# 2) 
library(plotly)


data %>% plot_ly(y= ~young, type = "box" , name = "young") %>% add_trace(y = ~adult, type ="box", name = "adult") %>% add_trace(y = ~senior, type ="box", name = "senior") %>% layout(yaxis = list(title = 'Income'))

Chunk Label: q2.3

 
library(akima)
attach(data)
s=interp(young,adult,senior, duplicate = "mean")
detach(data)

plot_ly(x=~s$x, y=~s$y, z=~s$z, type="surface")%>% layout(yaxis = list(title = 'Income'))

Chunk Label: q2.4


g=list(fitbounds="locations", visible=FALSE)

plot_geo(data)%>%add_trace(type="choropleth",geojson=county_map, locations=~region,z=~adult, featureidkey="properties.NAME_1")%>%layout(geo=g)

Chunk Label: 2.4

g=list(fitbounds="locations", visible=FALSE)

plot_geo(data)%>%add_trace(type="choropleth",geojson=county_map, locations=~region,z=~young, featureidkey="properties.NAME_1")%>%layout(geo=g)

Chunk Label: q2.5


g=list(fitbounds="locations", visible=FALSE)

plot_geo(data)%>%add_trace(type="choropleth",geojson=county_map, locations=~region,z=~young, featureidkey="properties.NAME_1") %>% add_trace(type="scattergeo",lat=~58.4108, lon=~15.6214) %>%layout(geo=g)